Skip to content

fix: improve access logs#2266

Merged
SkArchon merged 15 commits intomainfrom
milinda/eng-6989-improve-error-logs
Oct 13, 2025
Merged

fix: improve access logs#2266
SkArchon merged 15 commits intomainfrom
milinda/eng-6989-improve-error-logs

Conversation

@SkArchon
Copy link
Copy Markdown
Contributor

@SkArchon SkArchon commented Oct 8, 2025

This PR:

  • Introduces a add_stacktrace flag for access_logs which is true by default, which when enable would allow any access logs with error to print a stacktrace along with it
  • Introduces a level flag for access_logs which is info by default, when set to warn for example, any access logs above warn wonted be printed

Note:
We only previously introduced stack traces for panics, we keep that behaviour even if stack traces are disabled.

Summary by CodeRabbit

  • New Features

    • Access logs: configurable log level (debug/info/warn/error/panic/fatal) and optional stack traces per config.
    • Per-request log level handler so requests and subgraph entries are logged at appropriate levels.
    • Subgraph access logger supports error-level entries.
  • Bug Fixes

    • Errors now logged at error level instead of info.
    • Broken-pipe errors omit stack traces and include a broken_pipe indicator.
  • Tests / Docs

    • Expanded tests and updated config/schema defaults for access_logs (level, add_stacktrace).

Checklist

@SkArchon SkArchon marked this pull request as ready for review October 8, 2025 18:43
@github-actions github-actions Bot added the router label Oct 8, 2025
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Oct 8, 2025

Walkthrough

Thread configurable access-log level and optional stacktrace through logging constructors and config/schema; add per-request log level handling and SubgraphAccessLogger.Error; change engine OnFinished to log errors at ERROR; and expand tests for logging, stacktraces, and broken-pipe handling.

Changes

Cohort / File(s) Summary of changes
Logging core & constructors
router/pkg/logging/logging.go
Add stacktrace flag to logger constructors and core options; thread level and stacktrace into access-log constructors; add BufferedLoggerOptions.StackTrace; update signatures (New, defaultZapCoreOptions, NewZapLoggerWithCore, NewZapLogger, NewZapAccessLogger).
Startup & plan generator call sites
router/cmd/main.go, router/cmd/plan_generator.go
Update calls to logging.New(...) to pass new stacktrace boolean (sourced from config or set at call site).
Supervisor & logger initialization
router/core/supervisor_instance.go, router-tests/testenv/testenv.go
Parse AccessLogs.Level into zapcore.Level; propagate parsed level and AddStacktrace into buffered/non-buffered access-log constructors; update testenv logger constructor signatures.
Configuration & schema
router/pkg/config/config.go, router/pkg/config/config.schema.json, router/pkg/config/testdata/config_defaults.json, router/pkg/config/testdata/config_full.json
Add Level string and AddStacktrace bool to AccessLogsConfig; add level and add_stacktrace to JSON schema; update testdata defaults.
Request logger runtime & handlers
router/internal/requestlogger/requestlogger.go, router/core/graph_server.go, router/core/request_context_fields.go, router/core/engine_loader_hooks.go
Add WithLogLevelHandler option and logLevelHandler field; add LogLevelHandler(r *http.Request) zapcore.Level; use handler to call h.logger.Log(level, ...); wire option into router; OnFinished logs at ERROR when response had an error.
Subgraph logger & tests
router/internal/requestlogger/subgraphlogger.go, router/internal/requestlogger/subgraphlogger_test.go, router-tests/structured_logging_test.go
Add SubgraphAccessLogger.Error(...); update tests and NewZapLoggerWithCore call sites for new stacktrace parameter; introduce MyBrokenPipeModule test helper; expand tests for access-log level filtering, stacktrace behavior, broken-pipe handling, and emitted log fields.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title “fix: improve access logs” relates directly to the PR’s goal of enhancing access logging by adding configurable stacktrace inclusion and log level filtering, and it concisely identifies the affected subsystem. While it could be more specific about which aspects of access logs are improved, it still accurately summarizes the primary change.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 398ba9b and 726888e.

📒 Files selected for processing (1)
  • router/core/supervisor_instance.go (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • router/core/supervisor_instance.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: build-router
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: build_test
  • GitHub Check: build_push_image
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: Analyze (go)

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Oct 8, 2025

Router-nonroot image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-9fe1704369ee5c857cdcb713a4d95dd6b39e0c92-nonroot

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
router/pkg/config/testdata/config_full.json (1)

389-390: Consider showcasing non-defaults in full config to exercise behavior

Use this file to demonstrate filtering and toggle:

  • Set Level to "warn" (or higher) to show access-log suppression below threshold.
  • Optionally set AddStacktrace to false to cover both branches in tests.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 10fd2a4 and ea8e76c.

📒 Files selected for processing (3)
  • router/core/graph_server.go (1 hunks)
  • router/pkg/config/testdata/config_defaults.json (1 hunks)
  • router/pkg/config/testdata/config_full.json (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • router/core/graph_server.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-router
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: build_push_image
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./events)
  • GitHub Check: build_test
  • GitHub Check: Analyze (go)
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (1)
router/pkg/config/testdata/config_defaults.json (1)

191-192: Access-logs defaults verified — schema and struct align; re-evaluate stacktrace default

  • Schema (router/pkg/config/config.schema.json) and Go struct (router/pkg/config/config.go: AccessLogsConfig) both set level="info" and add_stacktrace=true; testdata (router/pkg/config/testdata/config_defaults.json) uses exported JSON names (Level/AddStacktrace) — expected.
  • Level parsing: router/core/supervisor_instance.go uses level.Set(strings.ToUpper(cfg.AccessLogs.Level)) — zapcore parses levels case‑insensitively and the schema enum restricts allowed values.
  • Security/privacy: default add_stacktrace=true can leak sensitive data in access logs — confirm compliance requirements. If not acceptable, change schema and AccessLogsConfig default to false or add explicit redaction/guardrails where stacktraces are emitted.

@SkArchon SkArchon changed the title fix: improve error logs fix: improve access logs Oct 8, 2025
Comment thread router/pkg/config/config.go Outdated
Comment thread router/pkg/config/config.schema.json Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
router-tests/structured_logging_test.go (1)

240-240: Consider using explicit log level constant.

The calls use 0 for the log level parameter, which maps to zapcore.DebugLevel. While functionally correct, using zapcore.DebugLevel explicitly would improve code clarity.

Apply this change for better readability:

-		logger := logging.NewZapAccessLogger(f, 0, false, false, true)
+		logger := logging.NewZapAccessLogger(f, zapcore.DebugLevel, false, false, true)

Also applies to: 343-343

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ea8e76c and a8d7261.

📒 Files selected for processing (4)
  • router-tests/structured_logging_test.go (7 hunks)
  • router/internal/requestlogger/requestlogger.go (5 hunks)
  • router/pkg/config/config.go (1 hunks)
  • router/pkg/config/config.schema.json (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • router/pkg/config/config.schema.json
  • router/pkg/config/config.go
🧰 Additional context used
🧬 Code graph analysis (2)
router/internal/requestlogger/requestlogger.go (1)
router/core/batch.go (1)
  • Handler (46-58)
router-tests/structured_logging_test.go (4)
router/core/modules.go (3)
  • EnginePreOriginHandler (122-126)
  • Module (52-54)
  • ModuleInfo (44-50)
router/core/context.go (1)
  • RequestContext (61-137)
router/pkg/logging/logging.go (2)
  • New (21-23)
  • NewZapAccessLogger (109-127)
router/core/router.go (3)
  • Option (172-172)
  • WithCustomModules (1762-1766)
  • WithHeaderRules (1726-1730)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: build-router
  • GitHub Check: build_test
  • GitHub Check: build_push_image
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: build_test
  • GitHub Check: Analyze (go)
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (11)
router/internal/requestlogger/requestlogger.go (5)

52-58: LGTM! Clean separation of concerns.

The new panicLogger field cleanly separates panic-specific logging (which should always include stacktraces) from regular access logs. The logLevelHandler function field provides flexible per-request log level control.


115-119: LGTM! Well-designed option.

The new WithLogLevelHandler option follows the existing functional options pattern and enables flexible per-request log level configuration.


127-138: LGTM! Correct initialization.

The panicLogger is properly initialized with zap.AddStacktrace(zapcore.ErrorLevel), ensuring that panic logs always include stacktraces regardless of the global stacktrace configuration. This is the correct behavior for panic recovery.


175-175: LGTM! Correct panic handling.

The panic logging correctly uses panicLogger for regular panics (which includes stacktraces) and suppresses stacktraces for broken pipe errors by temporarily raising the stacktrace level to PanicLevel.

Also applies to: 177-177


201-206: LGTM! Dynamic log level implementation.

The dynamic log level handling is implemented correctly, defaulting to InfoLevel when no handler is configured and allowing per-request customization via the logLevelHandler function.

router-tests/structured_logging_test.go (6)

7-7: LGTM! Appropriate test dependencies.

The new imports (io, net, syscall, zaptest/observer) provide necessary functionality for simulating broken pipe errors and observing log output in tests.

Also applies to: 9-9, 14-14, 23-23


65-90: LGTM! Proper broken pipe simulation.

The MyBrokenPipeModule correctly simulates a broken pipe error by creating a net.OpError with syscall.EPIPE, which matches the error type checked in the panic recovery logic.


3609-3717: LGTM! Comprehensive log level testing.

The test suite thoroughly validates the new access log level configuration feature:

  1. Default behavior (InfoLevel for successful requests)
  2. Level filtering (logs below configured level are suppressed)
  3. Error escalation (validation and subgraph errors logged at ErrorLevel)

The test at lines 3633-3659 demonstrates proper use of the observer pattern to verify that logs below the configured level are filtered out.


3719-3845: LGTM! Thorough panic stacktrace testing.

The tests correctly verify panic stacktrace behavior:

  1. Default: Panics always include stacktraces
  2. Override protection: Even when stacktraces are explicitly disabled, panics still include them (correct behavior for debugging)
  3. Broken pipe exception: Broken pipe errors don't include stacktraces to avoid log spam, and properly set the broken_pipe field

The assertion message at line 3797 clearly documents the expected behavior.


3847-3951: LGTM! Validates breaking change behavior.

These tests thoroughly verify the new stacktrace behavior for error access logs:

  • Validation errors: Include stacktraces when enabled (lines 3847-3868), omit when disabled (lines 3870-3898)
  • Subgraph errors: Include stacktraces when enabled (lines 3900-3921), omit when disabled (lines 3923-3951)
  • Error logging: All errors correctly logged at ErrorLevel

This validates the documented breaking change: stacktraces are now included for all error access logs (not just panics) when the feature is enabled.


3953-3972: LGTM! Verifies successful requests remain clean.

This test confirms that successful requests don't include stacktraces, ensuring the feature only activates for error conditions.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
router/core/supervisor_instance.go (1)

77-80: Consider improving the error message phrasing.

The error message "could not parse log level: %w for access logs" places "for access logs" after the error placeholder, which reads awkwardly. Consider rephrasing to "could not parse access log level: %w" or "could not parse log level for access logs: %w" for better clarity.

Otherwise, the level parsing logic is correct and properly handles error cases.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between da3a046 and 98e82f1.

📒 Files selected for processing (2)
  • router-tests/testenv/testenv.go (2 hunks)
  • router/core/supervisor_instance.go (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • router-tests/testenv/testenv.go
🧰 Additional context used
🧬 Code graph analysis (1)
router/core/supervisor_instance.go (2)
router/pkg/mcpserver/util.go (1)
  • Logger (6-9)
router/pkg/logging/logging.go (1)
  • NewZapAccessLogger (109-127)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: build-router
  • GitHub Check: image_scan
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: build_push_image
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_test
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (1)
router/core/supervisor_instance.go (1)

88-121: LGTM! Level and stacktrace parameters correctly threaded.

The implementation correctly threads the parsed log level and stacktrace configuration to all four logger instantiation paths (buffered/non-buffered × file/stdout). The parameter order matches the function signatures, and the logic is consistent across all code paths.

@SkArchon SkArchon merged commit 3ef065d into main Oct 13, 2025
46 of 48 checks passed
@SkArchon SkArchon deleted the milinda/eng-6989-improve-error-logs branch October 13, 2025 09:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants